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Digital systems as presently implemented are mostly unsafe. 
“Silent” errors occur throughout the system. 


These may be bit flips in data, in addresses, in read/write 
commands, or elsewhere. 


Error detection is often lacking, is ignored, or leads to system 
malfunction. “Fault-tolerant” systems attempt to mitigate the 
effects of detected errors, but they fail to yield correct results. 


An estimate of the lower bound on the silent error rate is about 
0.64 checksum mismatches per 100 drives per month.(45) 


The result is that stored data are silently and irreversibly lost, and 
incorrect computations are treated as correct. 


The Problem 


Origins 


Origins 


@ This problem has been a part of digital systems since their 
beginnings. 


@ Von Neumann discussed the theory in his 1952 Caltech lecture: 
“Probabilistic Logics and the Synthesis of Reliable Organisms 
from Unreliable Components.”(") 


@ J. Presper Eckert recognized the practical problem and 
implemented (mainly duplicate comparison) checking in EDVAC. 
See: Section 8 in(?) and(). 
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ə The defining property of digital systems is exact bit reproducibility. 
This is achieved by means of restoring logic. 
e Itis sometimes assumed that a digital representation of some 
information exactly preserves the original. This is never correct. 


e@ There is no reliable measure of “distance” in digital systems. 
@ Other human activities have localized and connected effects. 


ə The storage and arrangement of digital information can result in 
changes which are totally unrelated to the intended result. 


@ Current error checking is “local” and does not usually prevent 
corruption. 


The Problem 


e 
Examples of Error Checking 


Examples of Error Checking 
@ IBM disk drives circa 1968. (Address parity check.) 
ə Magnetic tape transfers: IBM and GE. (Incomplete, off by one.) 
e@ FTP transfers. (Incomplete, index errors.) 
@ Memory parity checking. (Ignored.) 
@ ECC checking. (Unmanaged.) 


@ The Unix/Linux “anything goes” principle. 


In code that has been examined, the error density is highest in the 
error handling code. 
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Design Principle 


Design Principle (Containment Strategy) 


@ All errors must be detected and resolved within a defined scope 
and before data propagation which may cause corruption. 


ə Note: The currently pervasive filesystem designs, based on the 
Unix top-down approach are inherently unsafe. Filesystems based 
on continuous consistency and self-identifying records have been 
implemented. The conversion to such a filesystem will be 
necessary at some time, as will some form of monotonic database. 
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A Solution 


Attributes of a Solution 


Attributes of a Solution 
@ Data through checking: no unchecked transfers. No “single valid 
copy.” 


@ All staticized data covered by error correction codes. 

@ Inhibit propagation of any error by check before transfer. 

@ Precise error reporting to define and isolate failing components. 
@ Proof of correctness by full fault injection. 


@ Improved control software that matches fully checked hardware 
(VM-baseq). 
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@ Current digital systems are unsafe. 


ə Data are being silently corrupted at a high rate (> 0.64 checksum 
mismatches/100 drives/month). (4:5) 


ə No viable mechanisms are in place to detect or recover from the 
corruption. 


@ A recent ISO Standard (T10 Data Integrity Field) provides an 
industry standard for dealing with the SCSI part of this problem. 
Much more must to be defined and made standard. T10 is in the 
current (3.4) Linux kernel, provided by Martin Petersen.(©”) Disk 
manufacturers are providing the space for CRC in sectors 
(512B+8 or 4KB+8). 


Summary 
oe 


@ It is not sufficient to claim reliability or correctness: this must be 
demonstrable at any time. 
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